Time-Sensitive Bandit Learning and Satisficing Thompson Sampling

نویسندگان

  • Daniel Russo
  • David Tse
  • Benjamin Van Roy
چکیده

The literature on bandit learning and regret analysis has focused on contexts where the goalis to converge on an optimal action in a manner that limits exploration costs. One shortcomingimposed by this orientation is that it does not treat time preference in a coherent manner.Time preference plays an important role when the optimal action is costly to learn relative tonear-optimal actions. This limitation has not only restricted the relevance of theoretical resultsbut has also influenced the design of algorithms. Indeed, popular approaches such as Thompsonsampling and UCB can fare poorly in such situations. In this paper, we consider discountedrather than cumulative regret, where a discount factor encodes time preference. We proposesatisficing Thompson sampling – a variation of Thompson sampling – and establish a strongdiscounted regret bound for this new algorithm.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Satisficing in Time-Sensitive Bandit Learning

Much of the recent literature on bandit learning focuses on algorithms that aim to converge on an optimal action. One shortcoming is that this orientation does not account for time sensitivity, which can play a crucial role when learning an optimal action requires much more information than near-optimal ones. Indeed, popular approaches such as upper-confidence-bound methods and Thompson samplin...

متن کامل

Bayesian bandits: balancing the exploration-exploitation tradeoff via double sampling

Reinforcement learning studies how to balance exploration and exploitation in realworld systems, optimizing interactions with the world while simultaneously learning how the world works. One general class of algorithms for such learning is the multiarmed bandit setting (in which sequential interactions are independent and identically distributed) and the related contextual bandit case, in which...

متن کامل

Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling

Recent advances in deep reinforcement learning have made significant strides in performance on applications such as Go and Atari games. However, developing practical methods to balance exploration and exploitation in complex domains remains largely unsolved. Thompson Sampling and its extension to reinforcement learning provide an elegant approach to exploration that only requires access to post...

متن کامل

Bootstrapped Thompson Sampling and Deep Exploration

This technical note presents a new approach to carrying out the kind of exploration achieved by Thompson sampling, but without explicitly maintaining or sampling from posterior distributions. The approach is based on a bootstrap technique that uses a combination of observed and artificially generated data. The latter serves to induce a prior distribution which, as we will demonstrate, is critic...

متن کامل

Analysis of Thompson Sampling for Stochastic Sleeping Bandits

We study a variant of the stochastic multiarmed bandit problem where the set of available arms varies arbitrarily with time (also known as the sleeping bandit problem). We focus on the Thompson Sampling algorithm and consider a regret notion defined with respect to the best available arm. Our main result is anO(log T ) regret bound for Thompson Sampling, which generalizes a similar bound known ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1704.09028  شماره 

صفحات  -

تاریخ انتشار 2017